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(54) Method of operating a data storage disk array. 



(57) A method for updating data and parity infor- 
mation in a RAID level 4 or 5 disk array employ- 
ing a read-mod ify-write (RMW) process for 
updating data and parity information separates 
the execution of data read and write operations 
from the execution of parity read, generation 
and write operations to permit greater effi- 
ciency in the utilization of the drives within the 
array. The method identifies the disk drives 
containing the data and parity to be updated 
and places the proper read and write requests 
into the I/O queues for the identified data and 
parity drives, scheduling parity operations ; i.e. 
reading old parity information from the parity 
drive, generating new parity information and 
writing the new parity information to the parity 
drive ; for execution when best accommodated 
in the I/O queue for the parity drive, following 
the read of old data from the data drive. 



6 
EE 



cog 



Kb 



CM 


to 


O 


CO 

>- 




0 


SC. 


\— 




2 




cc 


m 


CD 


CO 













v- 



0 I- 



icbi 



to 




0. 

LU 



Jouve, 18, rue Saint-Denis, 75001 PARIS 



9NSDOCID: <EP. 



0594464A2_I_> 



/ 



1 



EP 0 594 464 A2 



2 



This invention relates to a method of operating a 
data storage disk array. 

RAID (Redundant Array of Inexpensive Disks) 
storage systems have emerged as an alternative to 
large, expensive disk drives for use within present 
and future computer system architectures. A RAID 
storage system includes an array of small, inexpen- 
sive hard disk drives, such as the 5 1/4 or 3 1/2 inch 
disk drives currently used in personal computers and 
workstations. Although disk array products have 
been available for several years, significant improve- 
ments in the reliability and performance of small disk 
drives and a decline in the cost of such drives have 
resulted in the recent enhanced interest in RAID sys- 
tems 

Current disk array design alternatives are descri- 
bed in an article titled tt A Case for Redundant Arrays 
of Inexpensive Disks (RAID)" by David A. Patterson, 
Garth Gibson and Randy H. Katz; University of Cali- 
fornia Report No. UCB/CSD 87/391, December 1987. 
The article, discusses disk arrays and the improve- 
ments in performance, reliability, power consumption 
and scalability that disk arrays provide in comparison 
to single large magnetic disks. Five disk array ar- 
rangements, referred to as RAID levels, are descri- 
bed. The simplest array, a RAID level 1 system com- 
prises one or more disks for storing data and an equal 
number of additional "mirror" disks for storing copies 
of the information written to the data disks. The re- 
maining RAID levels, identified as RAID level 2, 3, 4 
and 5 systems, segment the data into portions for 
storage across several data disks. One or more addi- 
tional disks are utilized to store error check or parity 
information. The present invention is applicable to im- 
provements in the operation of RAID level 4 and 5 
systems. 

A RAID level 4 disk array is comprised of N+1 
disks wherein N disks are used to store data, and the 
additional disk is utilized to store parity information. 
Data to be saved is divided into portions consisting of 
one or many blocks of data for storage among the 
disks. The corresponding parity information, which 
can be calculated by performing a bit-wise exclusive- 
OR of corresponding portions of the data stored 
across the N data drives, is written to the dedicated 
parity disk. The parity disk is used to reconstruct in- 
formation in the eventof a disk failure. Writes typically 
require access to two disks, i.e., one of the N data 
disks and the parity disk, as will be discussed in great- 
er detail below. Read operations typically need only 
access a single one of the N data disks, unless the 
data to be read exceeds the block length stored on 
each disk. 

RAID level 5 disk arrays are similar to RAID level 
4 systems except that parity information, in addition 
to the data, is distributed across the N+1 disks in each 
group. Each one of the N+1 disks within the array in- 
cludes some blocks for storing data and some blocks 



for storing parity information. Where parity informa- 
tion is stored is controlled by an algorithm implement- 
ed by the user. As in RAID level 4 systems, RAID level 
5 writes typically require access to two disks; how- 
5 ever, no longer does every write to the array require 
access to the same dedicated parity disk, as in RAID 
level 4 systems. This feature provides the opportunity 
to perform concurrent write operations. 

A RAID level 5 system including five data and par- 
10 ity disk drives, DRIVE A through DRIVE E, and a 
spare disk drive, DRIVE F, is illustrated in Figure 1 . An 
array controller 100 coordinates the transfer of data 
between the host system 147 and the array disk 
drives. The controller also calculates and checks par- 
ts ity information. Blocks 145A through 145E illustrate 
the manner in which data and parity is stored on the 
five array drives. Data blocks are identified as 
BLOCK 0 through BLOCK 15. Parity blocks are iden- 
tified as PARITY 0 through PARITY 3. The relation- 
20 ship between the parity and data blocks is as follows: 
PARITY 0 = (BLOCK 0) XOR (BLOCK 1) XOR 
(BLOCK 2) XOR (BLOCK 3) 

PARITY 1 = (BLOCK 4) XOR (BLOCK 5) XOR 
(BLOCK 6) XOR (BLOCK 7) 
25 PARITY 2 = (BLOCK 8) XOR (BLOCK 9) XOR 
(BLOCK 10) XOR (BLOCK 11) 

PARITY 3 = (BLOCK 12) XOR (BLOCK 13) XOR 
(BLOCK 14) XOR (BLOCK 15) 

As stated above, parity data can be calculated by 

30 performing a bit-wise exclusive-OR of corresponding 
portions of the data stored across the N data drives. 
However, because each parity bit is simply the exclu- 
sive-OR product of all the corresponding data bits 
from the data drives, new parity can be more easily 

35 determined from the old data and the old parity as 
well as the new data in accordance with the following 
equation: 

new parity (old data XOR new data) XOR old parity. 
Although the parity calculation for RAID levels 4 

40 or 5 shown in the above equation is much simpler 
than performing a bit-wise exclusive-OR of corre- 
sponding portions of the data stored across all of the 
data drives, a typical RAID level 4 or 5 write operation 
will require a minimum of two disk reads and two disk 

45 writes. More than two disk reads and writes are re- 
quired for data write operations involving more than 
one data block. Each individual disk read operation in- 
volves a seek and rotation to the appropriate disk 
track and sector to be read. The seek time for all disks 

so is therefore the maximum of the seek times of each 
disk. A RAID level 4 or 5 system thus carries a signif- 
icant write penalty when compared with a single disk 
storage device or with RAID level 1 , 2 or 3 systems. 
It is an object of the present invention to provide 

55 a method of eff iciently operating a data storage disk 
array. 

Therefore, according to the present invention 
there is provided a method of operating a data stor- 
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age disk array including first and second disk drives 
in response to the receipt of new data from a host sys- 
tem, characterized by the steps of: reading old data 
from said first drive and saving said old data to a first 
storage buffer; writing said new data to a second stor- 
age buffer; replacing said old data residing on said 
first drive with said new data; generating new parity 
information after the conclusion of said reading step; 
and replacing said old parity information residing on 
said second drive with said new parity information. 

One embodiment of the present invention will 
now be described by way of example, with reference 
to the accompanying drawings, in which:- 

Ftgure 1 is a block diagram representation of a 
RAJD teve* 5 array including six disk drives. 

h igures 2A and 2B illustrate in block diagram form 
ore pos&it»e architecture for disk array controller 1 00 
snown m figure ' . 

h »gure j is a block diagram illustration of the logic 
included «w:hm Bus Switch block 400U shown in Fig- 
ures 2A and 2B 

Figures 4 and 5 illustrate a RAID level 5 read- 
modrty-wrue operation. 

Figures 6 through 9 illustrate the modified RAID 
level 5 write operation wherein parity write operations 
are delayed in accordance with the method of the 
present invention. 

Referring now to Figures 2A and 2B, the architec- 
ture of a disk array controller 100 for a RAID system 
is shown in block diagram form. The array controller 
coordinates the operation of the multitude of disk 
dnves within the array to perform read and write func- 
tions, parity generation and checking, and data re- 
storation and reconstruction. The controller exchang- 
es data with the host computer system (not shown) 
through Host Interface and CRC Logic block 200. 
Host l/F Logic block 200, under the control of proces- 
sor 101, interfaces an external 18-bit or 36-bit wide, 
SCSI-2 bus 107 associated with the host system with 
four internal 18-bit wide buffer busses ABUF, BBUF, 
CBUF and DBUF. Bus 107 connects to Host l/F Logic 
block 200 through a standard SCSI-2 chip set, repre- 
sented by blocks 109U and 109L and eighteen-bit 
busses 111U and 111L. Interconnection between 
block 200 and processor 101 is provided by address- 
/data bus 113. 

Host l/F Logic Block 200 operates to multiplex 
data between SCSI-2 devices 109U and 109Land the 
four buffer busses ABUF, BBUF, CBUF and DBUF. 
Block 200 provides multiplexing functionality be- 
tween busses 11 1U and 111 L and (1) all four buffer 
busses for 4 + 1 RAID level 3 and high bandwidth 
RAID level 5 applications by word striping data across 
the four buffer busses in a rotating sequential order, 
(2) one of two defined pairs of buffer busses for 2 + 
1 RAID level 3 applications by word striping data 
across the pair of buffer busses in a rotating sequen- 
tial order, of (3) any one of the buffer busses for RAID 
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level 1 and single bus RAID level 5 applications. 

Internal buffer busses ABUF, BBUF, CBUF and 
DBUF connect Host l/F Logic block 200 with a RAM 
buffer 120 and upper and lower byte bus switches 

5 400U and 400L. Buffer 120 possesses the capability 
to read and write 72-bit wide words from the four buf- 
fer busses, or individual 18-bit wide words from any 
one of the buffer busses. Eighteen or 36-bit access 
is also provided through transceivers 115 to bus 113. 

10 Bus switches 400U and 400L provide variable 

bus mapping between buffer busses ABUF, BBUF, 
CBUF and DBUF and six 1 8-bit wide drive busses lab- 
eled ADRV, BDRV, CDRV, DDRV, EDRV and FDRV, 
each switch providing routing for one byte (eight bits 

15 data and one bit parity) of information. Bus switches 
400U and 400L further include the capability to gen- 
erate parity information, which may be directed onto 
any of the buffer or drive busses, check parity infor- 
mation and reconstruct information stored on a failed 

20 disk drive. Figure 3, discussed below, provide greater 
detail concerning the construction and operation of 
bus switches 400U and 400L. 

Each one of drive busses ADRV, BDRV, CDRV, 
DDRV, EDRV and FDRV is connected to an associat- 

25 ed SCSI-2 device, labeled 1 30A through 1 30F, which 
provide connection to six corresponding disk drives 
(not shown) forming the disk array. The six drives will 
be identified herein as drives A through F. Reed- 
Solomon Cyclic Redundancy Check (RSCRC) logic 

30 blocks 500AB, 500CD and 500EF are connected be- 
tween busses ADRV and BDRV, CDRV and DDRV, 
and EDRV and FDRV, respectively, to provide error 
detection and generation of Reed-Solomon CRC for 
the array controller. 

35 The control of Host l/F Logic block 200; bus 

switches 400U and 400L; RSCRC logic blocks 
500AB, 500CD and 500EF; and SCSI devices 109U, 
109L, and 130A through 130F is provided by micro- 
processor 101. Communication between micropro- 

40 cessor 101, associated processor memory 103 and 
processor control inputs 105 and the above- identified 
elements is provided by address/data bus 113. Also 
shown connected to bus 113 is DMA Control Logic 
block 300. The logic within block 300 provides DMA 

45 control for Host l/F Logic block 200, bus switches 400 
U and 400L, SCSI-2 devices 130A through 130F and 
processor 101. 

The controller architecture shown in Figures 2A 
and 2B can be configured to accommodate different 

so quantities of disk drives and also to accommodate dif- 
ferent RAID configurations. 

The logic included within each one of bus 
switches 400U and 400L is shown in the block dia- 
gram of Figure 3. The structure shown is formed upon 

55 a single semiconductor chip. The four host ports, lab- 
eled 481 through 484, provide connection to the four 
controller busses ABUF, BBUF, CBUF and DBUF, re- 
spectively. The array ports, identified by reference 
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numerals 491 through 496, connect with the six disk 
drive busses ADRV, BDRV, CDRV, DDRV, EDRV and 
FDRV, respectively. Bus switches 4001) and 400L op- 
erate together to provide a unidirectional connection 
between any one of controller buses ABUF, BBUF, 5 
CBUF and DBUF and any one of drive buses ADRV, 
BDRV, CDRV, DDRV, EDRV and FDRV. Multiple con- 
nections between several controller busses and an 
equal number of drive busses is also permitted. Addi- 
tionally, the bus switches may provide unidirectional w 
connection of any controller bus to two or more drive 
busses. Parity information obtained via bus 453 can 
also be ported to any one of the drive busses. 

The architecture of each bus switch is composed 
of three primary blocks: a latch module 450, switch 15 
module 460, and a parity module 470. Switch module 
460 is connected between controller busses ABUF, 
BBUF, CBUF and DBUF and drive busses ADRV, 
BDRV, CDRV, DDRV, EDRV and FDRV. An additional 
bus 453 connects parity module 470 to bus switch 20 
module 460. Several functions are provided by bus 
switch module 460. First, bus switch module 460 pro- 
vides a unidirectional connection between any con- 
troller bus and any drive bus. Multiple connections be- 
tween several controller busses and an equal number 25 
of drive busses is also permitted. 

Second, the bus switch module provides connec- 
tion between any two or more of the drive busses. 
Such an operation is necessary for the transfer of in- 
formation between disk drives without interfering 30 
with host or controller operations. 

Third, bus switch module 460 provides connec- 
tion between any two or more of the controller buss- 
es. This mode of operation supports data reorganiza- 
tion on the controller by allowing data to be propagat- 35 
ed from one controller bus to another. This mode of 
turnaround operation is also advantageous for BIST 
(Built-in Self Test) development. 

Finally, the bus switch module provides unidirec- 
tional connection of any controller bus to one or more 40 
drive busses. Parity information obtained via bus 453 
can also be ported to any one of the drive busses. 

Parity module 470 includes connections to each 
of the controller busses for receiving data therefrom 
and a connection to bus 453 for providing parity infor- 45 
mation to bus switch module 460. Parity module 470 
generates parity information for RAID level 3, 4 and 
5 operations by performing a bit-wise exclusive-OR of 
each active controller bus. The parity information is 
provided to bus switch module 460 via bus 453. 50 

Figures 4 and 5 illustrate a RAID level 5 write in- 
volving DRIVE A and DRIVE B, wherein data is to be 
written to DRIVE B and parity information is to be up- 
dated on DRIVE A. Only structure required to facilitate 
the read-modify-write (RMW) operation is shown in 55 
Figures 4 and 5. 

Under direction of the controller processor, not 
shown, old data and parity information are first read 



from the two drives as shown in Figure 4. The old data 
and parity are read from the target areas within drives 
DRIVE B and DRIVE A, respectively, and routed via 
buses 135B and 135A to bus switch 400. Bus switch 
400 is configured to combine the received data and 
parity to generate the exclusive-OR product: 
old data XOR old parity. This product is stored in a 
first area 120D within buffer 120. New data received 
from host system 147 is concurrently saved to a sec- 
ond area 120A within buffer 120. 

New data and parity information is then written to 
DRIVE B and DRIVE A as shown in Figure 5. Bus 
switch 400 is reconfigured to route the new data read 
from area 120A in storage buffer 120 to DRIVE B. Bus 
switch 400 is further configured to generate new par- 
ity information by combining the new data with the 
previously saved product, old data XOR old parity, 
stored in storage buffer area 120D. The result, old 
data XOR old parity XOR new data, is written to 
DRIVE A. 

Figures 6 through 9 illustrate the modified RAID 
level 5 write operation wherein parity write operations 
are delayed in accordance with the method of the 
present invention. As with Figure 4 and 5, only struc- 
ture required to facilitate the write operation is shown 
in Figures 6 through 9. 

Figure 6 illustrates the first step in the modified 
RAID level 5 write operation. During this first step old 
data received from disk storage and new data re- 
ceived from the host system is saved to buffer 120. 
In Figure 6, the new data received from host system 
147 is directed through host l/F logic 200 and buffer 
bus ABUF to a first storage area with in buffer 120. Old 
data is read from DRIVE B and routed via drive buses 
1 35B and BDRV, bus switch 400 and buffer bus DBUF 
to a second storage area within buffer 120. The new 
data received from host system 147 is directed 
through host l/F logic 200 and buffer bus ABUF to a 
second storage area within buffer 120. Bus switch 
400 is thereafter reconfigured and the new data stor- 
ed in buffer 1 20 is written to DRIVE B as shown in Fig- 
ure 7. 

Old parity is read from Drive A and written to area 
120D within buffer 120 as shown in Figure 8. This 
step may be performed concurrendy with, or at any 
point after, the operation shown in Figure 6 wherein 
old data is read from DRIVE B and saved to storage 
buffer 120. Bus switch 400 is thereafter configured to 
generate new parity information by combining the 
new data from storage area 120A, old data from stor- 
age area 120C and the old parity from storage area 
120D as shown in Figure 9. The result, old data XOR 
old parity XOR new data, is written to DRIVE A. The 
new parity write operation shown in Figure 9 may be 
performed immediately upon the conclusion of the 
old parity read operation shown in Figure 8, or may be 
delayed for execution at a more suitable time. 

Separating the DRIVE B (data) and DRIVE A 
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(parity) read and write operations allows for more ef- 
ficient utilization of the disk drives. In the RMW pro- 
cedure shown in Figures 4 and 5 and discussed 
above, a delay in obtaining access to either the parity 
or data drives delays the entire RMW operation. Data 
on DRIVE B is updated as soon as the drive is avail- 
able; the update is not delayed in the event the parity 
drive, DRIVE A, is unavailable. Similarly, DRIVE A op- 
erations will not be stayed should DRIVE B be un- 
available. 

By delaying the parity read and write operations 
involving DRIVE A, the method of the present inven- 
tion permits utilization of DRIVE A for other input/out- 
put operations until such time as the parity read, gen- 
erate and write operations (Figures 8 and 9) can pro- 
ceed efficiently without inducing disk service time 
penalties. 

Scheduling of disk read and write operations is 
coordinated by the array controller which maintains 
separate I/O queues for each drive within the array. 
The method identifies the disk drives containing the 
data and parity to be updated, drives. DRIVE B and 
DRIVE A in the example described above, and places 
the proper read and write requests into the I/O 
queues for the identified data and parity drives, 
scheduling parity operations; i.e. reading old parity in- 
formation from DRIVE A, generating new parity infor- 
mation and writing the new parity information to 
DRIVE A; for execution when best accommodated in 
the I/O queue for DRIVE A, following the read of old 
data from DRIVE B. 

In addition, in order to minimize the overall re- 
sponse time seen by the host system upon issuing a 
write request, the modified write routine may include 
procedures for reporting write completion status to 
the host system just after the write of data to DRIVE 
B is completed, without waiting for the associated 
parity generation and write to DRIVE A to complete. 

To assure data reliability and integrity of parity in 
the array in the event of an array or drive failure, the 
array controller maintains a status table which identi- 
fies the pending parity blocks. Moreover, this status 
table should be placed in a safe secondary storage 
device, apart from the array controller, so as to sur- 
vive a controller failure and allow recovery. 

It can thus be seen that there has been provided 
by the present invention a method which improves 
the eff iciency of disk drive utilization within a disk ar- 
ray, minimizing I/O service times and I/O queue wait- 
ing times for individual drives within the disk array. Al- 
though a RAID level 5 system including an array con- 
troller and five disk drives for the storage of data and 
parity information is shown in the Figures, and dis- 
cussed above, those skilled in the art will recognize 
that the invention is not limited to the specific embodi- 
ment described above and that numerous modifica- 
tions and changes are possible without departing 
from the scope of the present invention. For example, 



the method may be utilized to improve the perfor- 
mance of RAID level 4 and other disk array systems. 
The method may also be employed by the host sys- 
tem processor for those disk array systems not in- 
5 eluding a dedicated array controller. 



Claims 

w 1. A method of operating a data storage disk array 
including first and second disk drives in response 
to the receipt of new data from a host system, 
characterized by the steps of: reading old data 
from said first drive and saving said old data to a 

15 first storage buffer (120C); writing said new data 

to a second storage buffer (120A); replacing said 
old data residing on said first drive with said new 
data; generating new parity information after the 
conclusion of said reading step; and replacing 

20 said old parity information residing on said sec- 

ond drive with said new parity information. 

2. A method according to claim 1 , characterized by 
the step of: 

25 issuing a write complete status signal to said host 

system upon the conclusion of said step of re- 
placing said old data residing on said first drive 
with said new data. 

30 3. A method according to claim 1 or 2, characterized 
in that said step of generating new parity informa- 
tion includes the steps of: reading old parity infor- 
mation from said second drive into a third storage 
buffer (120D) and combining said old parity infor- 

35 mation with said old data and said new data stor- 

ed within said first (120C) and second (120A) 
storage buffers, respectively, to generate new 
parity information, said new parity information 
being the product: old data XOR new data XOR 

40 old parity information. 

4. A method according to any one of the preceding 
claims characterized in that said disk array in- 
cludes a RAID level 5 disk array. 

45 



50 
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